Introduction

The following dataset was gathered from Kaggle.com and was collected from the UCI Machine Learning Repository The goal of this project is to analyse the effects of various factors that result in thyroid disease recurrence of well differentiated thyroid cancer. The factors are:

  1. Age: The age of the patient at the time of diagnosis or treatment.
  2. Gender: The gender of the patient (male or female).
  3. Smoking: Whether the patient is a smoker or not.
  4. Hx Smoking: Smoking history of the patient (e.g., whether they have ever smoked).
  5. Hx Radiotherapy: History of radiotherapy treatment for any condition.
  6. Thyroid Function: The status of thyroid function, possibly indicating if there are any abnormalities.
  7. Physical Examination: Findings from a physical examination of the patient, which may include palpation of the thyroid gland and surrounding structures.
  8. Adenopathy: Presence or absence of enlarged lymph nodes (adenopathy) in the neck region.
  9. Pathology: Specific types of thyroid cancer as determined by pathology examination of biopsy samples.
  10. Focality: Whether the cancer is unifocal (limited to one location) or multifocal (present in multiple locations).
  11. Risk: The risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type.
  12. T: Tumor classification based on its size and extent of invasion into nearby structures.
  13. N: Nodal classification indicating the involvement of lymph nodes.
  14. M: Metastasis classification indicating the presence or absence of distant metastases.
  15. Stage: The overall stage of the cancer, typically determined by combining T, N, and M classifications.
  16. Response: Response to treatment, indicating whether the cancer responded positively, negatively, or remained stable after treatment.
  17. Recurred: Indicates whether the cancer has recurred after initial treatment.

In this report you will first find statistical analysis of the dataset with the hopes of determining the influential variables in thyroid disease recurrence, followed by a number of models developed to predict it based on said variables.

The goal of this project is to determine the efficacy of utilizing machine learning models to predict the recurrence of Thyroid Disease. Predicting this recurrence could allow for better predictive treatment and diagnosis.

Analysis of Correlations

Column

Column

The following correlation matrix shows us which variables had an effect on the others. At each intersection in the matrix there is a square, the more blue the square the more of a negative correlation there is (i.e. as one variable increases, the other decreasses) and the more red a square is, the more positive of a correlation there is (i.e. as one variable increases, the other increases), while a white square represents variables with no effect on one another. The correlation matrix gave the following promising variables:

  1. T
  2. N
  3. Gender
  4. Smoking
  5. Age

This means that as each of these variables increases, the recurrence of thyroid disease typically increases. Here are the corresponding correlation tests. These tests were done with the “pearson” method. This will allows us to verify these correlations by utilizing another test. The smaller the p-value, the weaker of an argument we have that the recurrence of thyroid disease increased with these variables purely by “chance”, i.e. that they may be correlated.

[1] 1.741654e-32
[1] 3.710287e-44
[1] 4.546791e-11
[1] 2.192081e-11
[1] 2.776541e-07

Analysis

The following is a bar chart showing the amount of recurrences in patients who did smoke vs did not smoke

The following bar chart shows the correlation between gender, and the recurrence of thyroid disease

The following is a bar chart showing the recurrence of thyroid disease by age

The two bar charts shown represent the recurrence in tumor classification (T) and Nodal classification (N)

Stepwise Regression

Column

Here I choose to utilize Stepwise Regression with the significant variables (T, N, Gender, Smoking, Age) to further narrow my variables

 (Intercept)            T            N       Gender          Age 
-0.248510736  0.092106846  0.246961804  0.132810517  0.004227633 

Call:
lm(formula = Recurred ~ T + N + Gender + Age, data = encoded_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77382 -0.16654 -0.04562  0.10054  1.00844 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.248511   0.048136  -5.163 3.94e-07 ***
T            0.092107   0.013945   6.605 1.35e-10 ***
N            0.246962   0.021216  11.641  < 2e-16 ***
Gender       0.132811   0.043474   3.055 0.002411 ** 
Age          0.004228   0.001101   3.841 0.000144 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3123 on 378 degrees of freedom
Multiple R-squared:  0.5246,    Adjusted R-squared:  0.5196 
F-statistic: 104.3 on 4 and 378 DF,  p-value: < 2.2e-16

Column

The model chosen by the stepwise regression removes the Smoking variable, and provides us with a statistically significant p-value. Despite this, the following plots show evidence of non-linearity, and therefore show the model is not entirely reliable.

Tree Model

Column

Here I decide to use a tree model, utilizing the ANOVA method, and pruned.


Regression tree:
rpart(formula = Recurred ~ T + N + Gender + Smoking + Age, data = encoded_data, 
    method = "anova")

Variables actually used in tree construction:
[1] Age N   T  

Root node error: 77.546/383 = 0.20247

n= 383 

        CP nsplit rel error  xerror     xstd
1 0.378074      0   1.00000 1.00314 0.049671
2 0.077797      1   0.62193 0.67166 0.062079
3 0.055048      2   0.54413 0.61903 0.063729
4 0.022933      3   0.48908 0.55229 0.061454
5 0.011260      4   0.46615 0.53531 0.058504
6 0.010000      5   0.45489 0.58035 0.062015

Column

The following is analysis of the tree model
MSE:  0.09438056 
R-squared: 0.5338522 

kNN Classifier

Column

The following is the result of utilizing the k values 1,3,5,7,9,15,19,25,50 in a kNN algorithm of all of the variables.


Column

k = 1

This shows that a model using k = 1 is the most accurate, the following is further analysis of that model

[1] "Accuracy:  0.904411764705882"

Conclusion

TL;DR:

The kNN model proved to be the most promising, due to its high accuracy. The Tree and stepwise model, whilst having some promising attributes, showed indicators that the applied data were not suitable for the models.

Full conclusion:

After utilizing a stepwise regression model, tree model, and kNN classifier, I believe the best model was the kNN classifier. The classifier showed the highest accuracy of all the models, that being ~.89. This was determined after viewing multiple values of k, with the one providing the highest accuracy being k = 1. The Tree Model posed a few issues. Firstly, our R-Squared value was .53, meaning only 53% of the data could be explained by the model. Not only this but our residuals vs plotted showed an uneven distribution, causing more issues for the model. Finally, the Stepwise Regression model. I had faith in this model as it is designed to tune itself for the best accuracy. However, the highest accuracy it was able to produce was ~.51. This model also showed signs of non-linearity in the residuals vs fitted, meaning the model may be less accurate. Overall, the kNN classifier proved to be a strong predictor of Thyroid Disease recurrence.

Author: Sean Theisen

---
title: "Thyroid Disease Analysis"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard) #For dashboard creation
knitr::opts_chunk$set(echo = TRUE)
library(caret) #Useful functions
library(ggplot2) #Visualizations
library(corrplot)
library(MASS)
library(gridExtra)
library(rpart)
library(rpart.plot)
library("e1071") 
library("caTools") 
library("class") 
library(plotly)
library(heatmaply)
```

```{r, include=FALSE}
data <- read.csv("C:/Users/seanj/projects/thyroid_dash/data/Thyroid_Diff.csv")
#Label encoding of categorical variables
label_encode <- function(x){
  if(is.factor(x) || is.character(x)){
    as.numeric(factor(x))
  }else{
    x
  }
}

encoded_data <- as.data.frame(lapply(data, label_encode))
encoded_data <- as.data.frame(lapply(encoded_data, function(x) x - 1))
```

# Introduction
The following dataset was gathered from Kaggle.com and was collected from the UCI Machine Learning Repository
The goal of this project is to analyse the effects of various factors that result in thyroid disease recurrence of well differentiated thyroid cancer. The factors are:


1. Age: The age of the patient at the time of diagnosis or treatment.
2. Gender: The gender of the patient (male or female).
3. Smoking: Whether the patient is a smoker or not.
4. Hx Smoking: Smoking history of the patient (e.g., whether they have ever smoked).
5. Hx Radiotherapy: History of radiotherapy treatment for any condition.
6. Thyroid Function: The status of thyroid function, possibly indicating if there are any abnormalities.
7. Physical Examination: Findings from a physical examination of the patient, which may include palpation of the thyroid gland and surrounding structures.
8. Adenopathy: Presence or absence of enlarged lymph nodes (adenopathy) in the neck region.
9. Pathology: Specific types of thyroid cancer as determined by pathology examination of biopsy samples.
10. Focality: Whether the cancer is unifocal (limited to one location) or multifocal (present in multiple locations).
11. Risk: The risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type.
12. T: Tumor classification based on its size and extent of invasion into nearby structures.
13. N: Nodal classification indicating the involvement of lymph nodes.
14. M: Metastasis classification indicating the presence or absence of distant metastases.
15. Stage: The overall stage of the cancer, typically determined by combining T, N, and M classifications.
16. Response: Response to treatment, indicating whether the cancer responded positively, negatively, or remained stable after treatment.
17. Recurred: Indicates whether the cancer has recurred after initial treatment.

In this report you will first find statistical analysis of the dataset with the hopes of determining the influential variables in thyroid disease recurrence, followed by a number of models developed to predict it based on said variables.

The goal of this project is to determine the efficacy of utilizing machine learning models to predict the recurrence of Thyroid Disease. Predicting this recurrence could allow for better predictive treatment and diagnosis.


# Analysis of Correlations

## Column
```{r, echo=FALSE, out.width="90%", out.height="100%"}
#Statistical Analysis
res <- cor(encoded_data)
heatmaply_cor(res, show_dendrogram=c(FALSE, FALSE))
```

## Column

The following correlation matrix shows us which variables had an effect on the others. At each intersection in the matrix there is a square, the more blue the square the more of a negative correlation there is (i.e. as one variable increases, the other decreasses) and the more red a square is, the more positive of a correlation there is (i.e. as one variable increases, the other increases), while a white square represents variables with no effect on one another. The correlation matrix gave the following promising variables:

1.  T
2.  N
3.  Gender
4.  Smoking
5.  Age

This means that as each of these variables increases, the recurrence of thyroid disease typically increases.
Here are the corresponding correlation tests. These tests were done with the "pearson" method. This will allows us to verify these correlations by utilizing another test. The smaller the p-value, the weaker of an argument we have that the recurrence of thyroid disease increased with these variables purely by "chance", i.e. that they may be correlated.

```{r, echo=FALSE}
cor.test(encoded_data$T, encoded_data$Recurred)$p.value
cor.test(encoded_data$N, encoded_data$Recurred)$p.value
cor.test(encoded_data$Gender, encoded_data$Recurred)$p.value
cor.test(encoded_data$Smoking, encoded_data$Recurred)$p.value
cor.test(encoded_data$Age, encoded_data$Recurred)$p.value
```


# Analysis {.storyboard}

```{r, echo=FALSE}
recurred <- subset(encoded_data, Recurred == 1)
# Loop through each column except the first one (assuming the first column is 'recurred')
correlated_df <- encoded_data[, c("Recurred", "Gender", "Smoking", "Age", "T", "N")]
```

### The following is a bar chart showing the amount of recurrences in patients who did smoke vs did not smoke
```{r, echo=FALSE, out.width="50%"}
smoking_counts <- table(correlated_df[["Smoking"]])
names(smoking_counts) <- c("No", "Yes")
plot2 <- barplot(smoking_counts, main = "Smoking")
```

### The following bar chart shows the correlation between gender, and the recurrence of thyroid disease 
```{r, echo=FALSE, out.width="50%"}
gender_counts <- table(correlated_df[["Gender"]])
names(gender_counts) <- c("F", "M")
plot1 <- barplot(gender_counts, main = "Gender")
```

### The following is a bar chart showing the recurrence of thyroid disease by age
```{r, echo=FALSE, out.width="50%"}
age_counts <- table(correlated_df[["Age"]])
plot3 <- barplot(age_counts, xlab="Age", ylab="Freq", main="Age")
```

### The two bar charts shown represent the recurrence in tumor classification (T) and Nodal classification (N)
```{r, echo=FALSE, out.width="50%"}
T_counts <- table(correlated_df[["T"]])
plot4 <- barplot(T_counts, main="T")

N_counts <- table(correlated_df[["N"]])
plot5 <- barplot(N_counts, main="N")
```




# Stepwise Regression

## Column

Here I choose to utilize Stepwise Regression with the significant variables (T, N, Gender, Smoking, Age) to further narrow my variables

```{r, echo=FALSE}
lm <- lm(Recurred ~ T + N + Gender + Smoking + Age, data=encoded_data)
step.model <- stepAIC(lm, direction="both", trace=0)
srm <- lm(Recurred ~ T + N + Gender + Age, data=encoded_data)
step.model$coefficients
summary(srm)
```

## Column {data-width=500}

The model chosen by the stepwise regression removes the Smoking variable, and provides us with a statistically significant p-value. Despite this, the following plots show evidence of non-linearity, and therefore show the model is not entirely reliable. 

```{r, out.width="50%", echo=FALSE}
grid.arrange(plot(srm, which=1), plot(srm, which=2), plot(srm, which=3), plot(srm, which=4), ncol=2)
```

# Tree Model

## Column
Here I decide to use a tree model, utilizing the ANOVA method, and pruned.

```{r, echo=FALSE}
fit <- rpart(Recurred ~ T + N + Gender + Smoking + Age, data=encoded_data, method="anova")
fit_cp <- printcp(fit)
optimal_cp <- fit_cp[which.min(fit_cp[,"xerror"]),"CP"]
pruned_fit <- prune(fit, cp = optimal_cp)
rpart.plot(pruned_fit)
```

## Column
The following is analysis of the tree model
```{r, echo=FALSE}
pred <- predict(pruned_fit, encoded_data)
mse <- mean((encoded_data$Recurred - pred)^2)
rsq <- 1 - sum((encoded_data$Recurred - pred)^2) / sum((encoded_data$Recurred - mean(encoded_data$Recurred))^2)
cat("MSE: ", mse, "\nR-squared:", rsq, "\n")
```

```{r, echo=FALSE}
par(mfrow = c(2, 2))
# Residuals vs Fitted
plot(pred, residuals = encoded_data$Recurred - pred, main = "Residuals vs Fitted")
abline(h = 0, col = "red")

# Q-Q Plot of Residuals
qqnorm(encoded_data$Recurred - pred)
qqline(encoded_data$Recurred - pred, col = "red")

# Scale-Location Plot
plot(pred, sqrt(abs(encoded_data$Recurred - pred)), main = "Scale-Location")
abline(h = 0, col = "red")

# Cook's Distance
cooksd <- cooks.distance(lm(Recurred ~ ., data = encoded_data))
plot(cooksd, main = "Cook's Distance")
abline(h = 4/length(encoded_data$Recurred), col = "red")
```

# kNN Classifier
## Column
### The following is the result of utilizing the k values 1,3,5,7,9,15,19,25,50 in a kNN algorithm of all of the variables.  

---
```{r, echo=FALSE}
split <- sample.split(encoded_data, SplitRatio=.7)
train_cl <- subset(encoded_data, split=="TRUE")
test_cl <- subset(encoded_data, split=="FALSE")
train_scale <- subset(train_cl[, 1:17])
test_scale <- subset(test_cl[, 1:17])
k_values <- c(1,3,5,7,9,15,19,25,50)
accuracy_values <- sapply(k_values, function(k){
  classifier_knn <- knn(train = train_scale,
                        test = test_scale,
                        cl = train_cl$Recurred,
                        k=k)
  1-mean(classifier_knn != test_cl$Recurred)
})
accuracy_data <- data.frame(K = k_values, Accuracy=accuracy_values)
ggplot(accuracy_data, aes(x = K, y = Accuracy)) +
  geom_line(color = "lightblue", size = 1) +
  geom_point(color = "lightgreen", size = 3) +
  labs(x = "Number of Neighbors (K)",
       y = "Accuracy") +
  theme_minimal()
```

## Column
### k = 1

This shows that a model using k = 1 is the most accurate, the following is further
analysis of that model

```{r, echo=FALSE}
classifier_knn <- knn(train = train_scale,
                        test = test_scale,
                        cl = train_cl$Recurred,
                        k=1)
acc <- 1-mean(classifier_knn != test_cl$Recurred)
cm <- table(test_cl$Recurred, classifier_knn)
print(paste("Accuracy: ", acc))
plot(classifier_knn, col=rainbow(2), xlab="Recurrence (0=No, 1=Yes)")
```

# Conclusion

TL;DR:

The kNN model proved to be the most promising, due to its high accuracy. The Tree and stepwise model, whilst having some promising attributes, showed indicators that the applied data were not suitable for the models. 

Full conclusion:

After utilizing a stepwise regression model, tree model, and kNN classifier, I believe the best model was the kNN classifier. The classifier showed the highest accuracy of all the models, that being ~.89. This was determined after viewing multiple values of k, with the one providing the highest accuracy being k = 1.
The Tree Model posed a few issues. Firstly, our R-Squared value was .53, meaning only 53% of the data could be explained by the model. Not only this but our residuals vs plotted showed an uneven distribution, causing more issues for the model. 
Finally, the Stepwise Regression model. I had faith in this model as it is designed to tune itself for the best accuracy. However, the highest accuracy it was able to produce was ~.51. This model also showed signs of non-linearity in the residuals vs fitted, meaning the model may be less accurate. 
Overall, the kNN classifier proved to be a strong predictor of Thyroid Disease recurrence.

Author: Sean Theisen